Signatures of T cell immunity revealed using sequence similarity with TCRDivER algorithm

Changes in the T cell receptor (TCR) repertoires have become important markers for monitoring disease or therapy progression. With the rise of immunotherapy usage in cancer, infectious and autoimmune disease, accurate assessment and comparison of the “state" of the TCR repertoire has become paramount. One important driver of change within the repertoire is T cell proliferation following immunisation. A way of monitoring this is by investigating large clones of individual T cells believed to bind epitopes connected to the disease. However, as a single target can be bound by many different TCRs, monitoring individual clones cannot fully account for T cell cross-reactivity. Moreover, T cells responding to the same target often exhibit higher sequence similarity, which highlights the importance of accounting for TCR similarity within the repertoire. This complexity of binding relationships between a TCR and its target convolutes comparison of immune responses between individuals or comparisons of TCR repertoires at different timepoints. Here we propose TCRDivER algorithm (T cell Receptor Diversity Estimates for Repertoires), a global method of T cell repertoire comparison using diversity profiles sensitive to both clone size and sequence similarity. This approach allowed for distinction between spleen TCR repertoires of immunised and non-immunised mice, showing the need for including both facets of repertoire changes simultaneously. The analysis revealed biologically interpretable relationships between sequence similarity and clonality. These aid in understanding differences and separation of repertoires stemming from different biological context. With the rise of availability of sequencing data we expect our tool to find broad usage in clinical and research applications.


Supplementary Note 1: TCRDivER and Diversity Features
Supplementary Figure 1: An overview of the calculation of the distance matrix. The list of CDR3 sequences is divided into lists of equal length, here 10 sequences, the default value is 100. These 10 CDR3s are then pairwise compared with all the other CDR3s in the total list of CDR3 sequences. Each portion of the distance matrix i.e. chunk has 10 rows and S columns, where S corresponds to the total number of CDR3 clone sequences. In the end there are n chunks, where n is equal to the floored division of total number of sequences by the length of chunk n = S length of chunk . Distances d are d(CDR3 i , CDR3 j ) calculated based on the BLOSUM45 alignment score. The diagonal of the combined distance matrix is 0.
Supplementary Note 1.1: Evaluation of naive diversity of the first order (1) Exploring the limit as q approaches 1 allows us to apply L'Hopitals rule: lim q→1 ln (D(q)) = lim The solution is: This is equivalent to: Supplementary Note 1.2: Evaluating the slope at q = 1 We start with the definition of q D: Now we can write ln(D(q)) = 1 1 − q ln S Let denote differentiation with respect to q. Now evaluate ln(D(q)) = 1 (1 − q) 2 ln S + 1 1 − q (ln S) ln(D(q)) = ln S + (1 − q)(ln S) (1 − q) 2 (9) To evaluate the limit as q goes to 1 we need to apply l'hopital's rule twice. Calling the numerator t t = ln S + (1 − q)(ln S) (10) Since ln S and all its derivatives are finite as q goes to 1 We need Because p i is a probability distribution which means (since the limit of quotient is the quotient of the limits) Finally, we want to evaluate D(q) and then take the limit as q goes to 1.
and lim To generalise to the case Z = I we simply have to replace S with In this case Supplementary Note 1.3: Evaluation of similarity scaled diversity of the first order We start with: Rewriting the equation, calculating the limit as q → 1 and applying L'Hopitals rule: The result is: , which is equivalent to: We would like to note that the use of L'Hopitals rule has been established in literature to link scaled diversity measueres to Shannon entropy [40].
Supplementary Note 1.4: Evaluation of naive diversity of the infinity order We start with the formula for naive diversity and extract the largest clone frequency p max : ,where p j = pj pmax for j = max, and p max is represented in the first term of the sum. Since a limit of products is a product of limits, it follows: The first limit is evaluated as: The second limit is evaluated by taking the logarithm: Since 0 < S j p q j < 1, the bounds of logarithm are: , which gives: Supplementary Note 1.5: Evaluation of similarity scaled diversity of the infinity order We start with: where the term that has been pulled out is the one for which (Zp) i is maximum. The p max is the corresponding p i . As before, the p j are defined as pj pmax for j = max and (Zp) j is defined as (Zp)j (Zp)max . Again, the limit splits in to two factors: Taking the log of the second term gives: and now the log is bounded by: so again the limit of the log second factor in (*) is 0, and limit of the factor itself is 1. The end result is: which reduces to the correct limit when Z=I which is the naive diversity.
We start with the assumption that for λ around 0: Where ln (D(q, λ)) is: For λ → 0 by applying Taylor expansion e −λdij reduces to 1 − λd ij which gives: We can then rewrite: By applying the binomial expansion we arrive at: By substituting the derived expressions in the formula for ln (D(q, λ)) and keeping in mind that S i=1 p i = 1, we can write: By applying the linear approximation ln(1 − x) ≈ −x, we finally arrive: Note that the final form of the evaluation of D(q, λ) for λ → 0 is independent of the order of diversity q. It is solely dependent on the distance between CDR3 sequences weighted by their respective frequencies.
Supplementary Note 1.7: Evaluation ∆ln(D(q, λ)) for small λ and it's relationship to distance By evaluating ∆ln(D(q, λ)) for two values of small λ, where λ > λ we arrive at: It is evident that ∆ln(D(q, λ)) is linearly dependent on the distances between CDR3s and their probabilities. In the case of two hypothetical repertoires, I and II, which have a uniform distribution of CDR3 frequencies within the repertoire p I i = p II i = p and distances between CDR3s d I ij > d II ij , ∆ln(D(q, λ)) for repertoire I is larger than ∆ln(D(q, λ)) for repertoire II. That is with the increase of similarity between CDR3s, the area between the curves for small λs decreases. Alternatively, if the distances between CDR3s of the two repertoires are the same d I ij = d II ij = d, and the distribution is still uniform, but the number of clones differs so that repertoire I has less clones than II i.e. p I i > p II i , then ∆ln(D(q, λ)) is larger than ∆ln(D(q, λ)). Meaning that repertoires with more abundant clones have a larger ∆ln(D(q, λ)) for small λs.
Supplementary Note 1.8: Evaluation ∆ln(D(q, λ)) for larger λs and it's relationship to distance In order to evaluate the relationship between CDR3 clone distance and the area between the curves of larger λs we have constructed three mock repertoires. The repertoires constist of 100 CDR3s that are uniformly distributed in the repertoire, i.e. p i = 1 S = 1 100 = 0.01. For each mock repertoire a mock distance matrix was calculated so that the distance between the CDR3s within the repertoire were equal, but that they differ between the repertoires. The distances were d I i,j = 0.05, d II i,j = 0.1 and d III i,j = 0.5, for repertoire I, II and III respectively when i = j, else d i,j = 0 for i = j. Individual λ curves of the diversity profiles straight lines -a remnant of uniform distribution of CDR3 frequencies in the repertoire (Figure 2) Supplementary Figure 2: Effect of CDR3 distance shown in three mock repertoires with a uniform distribution of 100 CDR3 clones in in the repertoire. A. Schematic representation of the three mock repertoires with the distances d ij between CDR3s increasing from repertoire I to III. B. Diversity profiles calculated based on the probability distribution and d ij for CDR3s in the mock repertoires. The frequency of seeing each CDR3 clone in all the repertoires, since they consists of 100 uniformly distributed CDR3s, is p i = 1 100 = 0.01 C. Calculated values of average ∆ ln (D(q, λ)) for small λs and calculated area between λ identity and 16 curves for the three repertoires, shown left to right respectively.   Supplementary Figure 10: Principal Components Analysis on true diversity D(q, λ) calculated for the murine dataset using the BLOSUM45 distance for CDR β 3. The aspect ratio corresponds to variation found by PCA.
Supplementary Note 2.4: Diversity profiles of murine dataset calculated using Atchley factor CDR β 3 distances Graphs showing relationships between some f the divP features of the murine dataset with random frequencies. From left to right: average ∆ ln (D(q, λ)) for small λs is shown versus the slope of q = 0 → 1 for value of λ 64.0; average ∆ ln (D(q, λ)) for small λs is shown versus the area between curves of λ =identity and 16.0; slope of q = 1 → 2 for value of λ =identity (i.e. naive diversity) is shown versus the slope of q = 0 → 1 for value of λ 64.0; slope of q = 1 → 2 for value of λ =identity (i.e. naive diversity) is shown versus the area between curves of λ =identity and 64.0.
Supplementary  Supplementary Note 3.5: Features of diversity profiles from the randomised human dataset with random frequencies with BLOSUM45 as CDR β 3 distance Supplementary Figure 31: Graphs showing relationships between some of the divP features of the human randomised dataset. a. average ∆ ln D(q, λ) for small λs is shown versus the area between curves of λ = identity and 16.0; b. average ∆ ln D(q, λ) for small λs is shown versus the slope of q = 0 → 1 for value of λ 64.0; c. slope of q = 1 → 2 for value of λ identity (i.e. naive diversity) is shown versus the area between curves of λ = identity and d. 64.0; slope of q = 1 → 2 for value of λ identity (i.e. naive diversity) is shown versus the slope of q = 0 → 1 for value of λ 64.0.